Cost-sensitive Web-based Information Acquisition for Record Matching

نویسندگان

YEE FAN TAN

Kazunari Sugiyama

Jin Zhao

Ziheng Lin

Jesse Prabawa Gozali

Jun Ping Ng

Aobo Wang

Cong Duy Vu Hoang

Emma Thuy Dung Nguyen

Minh Thang Luong

Yee Seng Chan

Wei Lu

Victor Goh

Huaxin Xu

Gang Wang

Yantao Zheng

Zhaoyan Ming

چکیده

In many record matching problems, the input data is either ambiguous or incomplete, making the record matching task difficult. However, for some domains, evidence for record matching decisions are readily available in large quantities on the Web. These resources may be retrieved by making queries to a search engine, making the Web a valuable resource. On the other hand, Web resources are slow to acquire compared to data that is already available in the input. Also, some Web resources must be acquired before others. Hence, it is necessary to acquire Web resources selectively and judiciously, while satisfying the acquisition dependencies between these resources. This thesis has two major goals: 1. To establish that acquisition of web based resources can benefit the task performance of record matching tasks, and 2. To propose an algorithm for selective acquisition of web based resources for record matching tasks. It should balance acquisition costs and acquisition benefits, while taking acquisition dependencies between resources into account. This thesis has two major parts corresponding to the two goals. In the first part, I propose methods for using information from the Web for three different record matching problems, namely, author name disambiguation, linkage of short forms to long forms, and web people search. Thus, I establish that acquiring web based resources can improve record matching tasks. In the second and larger part, I propose approaches for selective acquisition of web based resources for record matching tasks, with the aim of balancing acquisition costs vii ABSTRACT and acquisition benefits. These approaches start from the more task-specific and move towards the more general and principled. I first propose a way for adaptively combining two methods for record matching, followed by a cost-sensitive attribute value acquisition algorithm for support vector machines. This work culminates in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies, which is the main contribution in this thesis. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account. Finally, I proposed two benefit functions for use in my framework, corresponding to two different evaluation measures.and acquisition benefits. These approaches start from the more task-specific and move towards the more general and principled. I first propose a way for adaptively combining two methods for record matching, followed by a cost-sensitive attribute value acquisition algorithm for support vector machines. This work culminates in a framework for performing cost-sensitive resource acquisition problems with hierarchical dependencies, which is the main contribution in this thesis. This graphical framework is versatile and can apply to a large variety of problems. In the context of this framework, I propose an effective resource acquisition algorithm for record matching problems, taking particular characteristics of such problems into account. Finally, I proposed two benefit functions for use in my framework, corresponding to two different evaluation measures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Hierarchical Cost-sensitive Web Resource Acquisition∗

Many record matching problems involve information that is insufficient or incomplete, and thus solutions that classify which pairs of records are matches often involve acquiring additional information at some cost. For example, web resources impose extra query or download time. As the amount of resources that can be acquired is large, solutions invariably acquire only a subset of the resources ...

متن کامل

A procedure for Web Service Selection Using WS-Policy Semantic Matching

In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...

متن کامل

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Ontology is the main infrastructure of the Semantic Web which provides facilities for integration, searching and sharing of information on the web. Development of ontologies as the basis of semantic web and their heterogeneities have led to the existence of ontology matching. By emerging large-scale ontologies in real domain, the ontology matching systems faced with some problem like memory con...

متن کامل

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...

متن کامل

امنیت اطلاعات سامانه های تحت وب نهاد کتابخانه های عمومی کشور

Purpose: This paper aims to evaluate the security of web-based information systems of Iran Public Libraries Foundation (IPLF). Methodology: Survey method was used as a method for implementation. The tool for data collection was a questionnaire, based on the standard ISO/IEC 27002, that has the eleven indicators and 79 sub-criteria, which examines security of web-based information systems of IP...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Cost-sensitive Web-based Information Acquisition for Record Matching

نویسندگان

چکیده

منابع مشابه

A Framework for Hierarchical Cost-sensitive Web Resource Acquisition∗

A procedure for Web Service Selection Using WS-Policy Semantic Matching

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

امنیت اطلاعات سامانه های تحت وب نهاد کتابخانه های عمومی کشور

عنوان ژورنال:

اشتراک گذاری